- Natural Language Processing
- Overview
- This skill provides comprehensive tools for building NLP applications using modern transformers, BERT, GPT, and classical NLP techniques for text classification, named entity recognition, sentiment analysis, and more.
- When to Use
- Building text classification systems for sentiment analysis, topic categorization, or intent detection
- Extracting named entities (people, places, organizations) from unstructured text
- Implementing machine translation, text summarization, or question answering systems
- Processing and analyzing large volumes of textual data for insights
- Creating chatbots, virtual assistants, or conversational AI applications
- Fine-tuning pre-trained transformer models for domain-specific NLP tasks
- NLP Core Tasks
- Text Classification
-
- Sentiment, topic, intent classification
- Named Entity Recognition
-
- Identifying people, places, organizations
- Machine Translation
-
- Text translation between languages
- Text Summarization
-
- Extracting key information
- Question Answering
-
- Finding answers in documents
- Text Generation
-
- Generating coherent text
- Popular Models and Libraries
- Transformers
-
- BERT, GPT, RoBERTa, T5
- spaCy
-
- Industrial NLP pipeline
- NLTK
-
- Classic NLP toolkit
- Hugging Face
-
- Pre-trained models hub
- PyTorch/TensorFlow
- Deep learning frameworks Python Implementation import numpy as np import pandas as pd import matplotlib . pyplot as plt from collections import Counter import re import nltk from nltk . tokenize import word_tokenize , sent_tokenize from nltk . corpus import stopwords from nltk . stem import PorterStemmer , WordNetLemmatizer import torch from transformers import ( AutoTokenizer , AutoModelForSequenceClassification , AutoModelForTokenClassification , pipeline , TextClassificationPipeline ) from sklearn . feature_extraction . text import TfidfVectorizer , CountVectorizer from sklearn . naive_bayes import MultinomialNB from sklearn . metrics import accuracy_score , precision_score , recall_score , f1_score import warnings warnings . filterwarnings ( 'ignore' )
Download required NLTK resources
try : nltk . data . find ( 'tokenizers/punkt' ) except LookupError : nltk . download ( 'punkt' ) print ( "=== 1. Text Preprocessing ===" ) def preprocess_text ( text , remove_stopwords = True , lemmatize = True ) : """Complete text preprocessing pipeline"""
Lowercase
text
text . lower ( )
Remove special characters and digits
text
re . sub ( r'[^a-zA-Z\s]' , '' , text )
Tokenize
tokens
word_tokenize ( text )
Remove stopwords
if remove_stopwords : stop_words = set ( stopwords . words ( 'english' ) ) tokens = [ t for t in tokens if t not in stop_words ]
Lemmatize
if lemmatize : lemmatizer = WordNetLemmatizer ( ) tokens = [ lemmatizer . lemmatize ( t ) for t in tokens ] return tokens , ' ' . join ( tokens ) sample_text = "The quick brown foxes are jumping over the lazy dogs! Amazing performance." tokens , processed = preprocess_text ( sample_text ) print ( f"Original: { sample_text } " ) print ( f"Processed: { processed } " ) print ( f"Tokens: { tokens } \n" )
2. Text Classification with sklearn
print ( "=== 2. Traditional Text Classification ===" )
Sample data
texts
[ "I love this product, it's amazing!" , "This movie is fantastic and entertaining." , "Best purchase ever, highly recommended." , "Terrible quality, very disappointed." , "Worst experience, waste of money." , "Horrible service and poor quality." , "The food was delicious and fresh." , "Great atmosphere and friendly staff." , "Bad weather today, very gloomy." , "The book was boring and uninteresting." ] labels = [ 1 , 1 , 1 , 0 , 0 , 0 , 1 , 1 , 0 , 0 ]
1: positive, 0: negative
TF-IDF vectorization
tfidf
TfidfVectorizer ( max_features = 100 , ngram_range = ( 1 , 2 ) ) X_tfidf = tfidf . fit_transform ( texts )
Train classifier
clf
MultinomialNB ( ) clf . fit ( X_tfidf , labels )
Evaluate
predictions
clf . predict ( X_tfidf ) print ( f"Accuracy: { accuracy_score ( labels , predictions ) : .4f } " ) print ( f"Precision: { precision_score ( labels , predictions ) : .4f } " ) print ( f"Recall: { recall_score ( labels , predictions ) : .4f } " ) print ( f"F1: { f1_score ( labels , predictions ) : .4f } \n" )
3. Transformer-based text classification
print ( "=== 3. Transformer-based Classification ===" ) try :
Use Hugging Face transformers for sentiment analysis
sentiment_pipeline
pipeline ( "sentiment-analysis" , model = "distilbert-base-uncased-finetuned-sst-2-english" ) test_sentences = [ "This is a wonderful movie!" , "I absolutely hate this product." , "It's okay, nothing special." , "Amazing quality and fast delivery!" ] print ( "Sentiment Analysis Results:" ) for sentence in test_sentences : result = sentiment_pipeline ( sentence ) print ( f" Text: { sentence } " ) print ( f" Sentiment: { result [ 0 ] [ 'label' ] } , Score: { result [ 0 ] [ 'score' ] : .4f } \n" ) except Exception as e : print ( f"Transformer model not available: { str ( e ) } \n" )
4. Named Entity Recognition (NER)
print ( "=== 4. Named Entity Recognition ===" ) try : ner_pipeline = pipeline ( "ner" , model = "distilbert-base-uncased" , aggregation_strategy = "simple" ) text = "Apple Inc. was founded by Steve Jobs in Cupertino, California." entities = ner_pipeline ( text ) print ( f"Text: { text } " ) print ( "Entities:" ) for entity in entities : print ( f" { entity [ 'word' ] } : { entity [ 'entity_group' ] } (score: { entity [ 'score' ] : .4f } )" ) except Exception as e : print ( f"NER model not available: { str ( e ) } \n" )
5. Word embeddings and similarity
print ( "\n=== 5. Word Embeddings and Similarity ===" ) from sklearn . metrics . pairwise import cosine_similarity
Simple bag-of-words embeddings
vectorizer
CountVectorizer ( max_features = 50 ) docs = [ "machine learning is great" , "deep learning uses neural networks" , "machine learning and deep learning" ] embeddings = vectorizer . fit_transform ( docs ) . toarray ( )
Compute similarity
similarity_matrix
cosine_similarity ( embeddings ) print ( "Document Similarity Matrix:" ) print ( pd . DataFrame ( similarity_matrix , columns = [ f"Doc { i } " for i in range ( len ( docs ) ) ] , index = [ f"Doc { i } " for i in range ( len ( docs ) ) ] ) . round ( 3 ) )
6. Tokenization and vocabulary
print ( "\n=== 6. Tokenization Analysis ===" ) corpus = " " . join ( texts ) tokens , _ = preprocess_text ( corpus )
Vocabulary
vocab
Counter ( tokens ) print ( f"Vocabulary size: { len ( vocab ) } " ) print ( "Top 10 most common words:" ) for word , count in vocab . most_common ( 10 ) : print ( f" { word } : { count } " )
7. Advanced Transformer pipeline
print ( "\n=== 7. Advanced NLP Tasks ===" ) try :
Zero-shot classification
zero_shot_pipeline
pipeline ( "zero-shot-classification" , model = "facebook/bart-large-mnli" ) sequence = "Apple is discussing the possibility of acquiring startup for 1 billion dollars" candidate_labels = [ "business" , "sports" , "technology" , "politics" ] result = zero_shot_pipeline ( sequence , candidate_labels ) print ( "Zero-shot Classification Results:" ) for label , score in zip ( result [ 'labels' ] , result [ 'scores' ] ) : print ( f" { label } : { score : .4f } " ) except Exception as e : print ( f"Advanced pipeline not available: { str ( e ) } \n" )
8. Text statistics and analysis
print ( "\n=== 8. Text Statistics ===" ) sample_texts = [ "Natural language processing is fascinating." , "Machine learning enables artificial intelligence." , "Deep learning revolutionizes computer vision." ] stats_data = [ ] for text in sample_texts : words = text . split ( ) chars = len ( text ) avg_word_len = np . mean ( [ len ( w ) for w in words ] ) stats_data . append ( { 'Text' : text [ : 40 ] + '...' if len ( text )
40 else text , 'Words' : len ( words ) , 'Characters' : chars , 'Avg Word Len' : avg_word_len } ) stats_df = pd . DataFrame ( stats_data ) print ( stats_df . to_string ( index = False ) )
9. Visualization
print ( "\n=== 9. NLP Visualization ===" ) fig , axes = plt . subplots ( 2 , 2 , figsize = ( 14 , 10 ) )
Word frequency
word_freq
vocab . most_common ( 15 ) words , freqs = zip ( * word_freq ) axes [ 0 , 0 ] . barh ( range ( len ( words ) ) , freqs , color = 'steelblue' ) axes [ 0 , 0 ] . set_yticks ( range ( len ( words ) ) ) axes [ 0 , 0 ] . set_yticklabels ( words ) axes [ 0 , 0 ] . set_xlabel ( 'Frequency' ) axes [ 0 , 0 ] . set_title ( 'Top 15 Most Frequent Words' ) axes [ 0 , 0 ] . invert_yaxis ( )
Sentiment distribution
sentiments
[ 'Positive' , 'Negative' , 'Positive' , 'Negative' , 'Positive' ] sentiment_counts = Counter ( sentiments ) axes [ 0 , 1 ] . pie ( sentiment_counts . values ( ) , labels = sentiment_counts . keys ( ) , autopct = '%1.1f%%' , colors = [ 'green' , 'red' ] ) axes [ 0 , 1 ] . set_title ( 'Sentiment Distribution' )
Document similarity heatmap
im
axes [ 1 , 0 ] . imshow ( similarity_matrix , cmap = 'YlOrRd' , aspect = 'auto' ) axes [ 1 , 0 ] . set_xticks ( range ( len ( docs ) ) ) axes [ 1 , 0 ] . set_yticks ( range ( len ( docs ) ) ) axes [ 1 , 0 ] . set_xticklabels ( [ f'Doc { i } ' for i in range ( len ( docs ) ) ] ) axes [ 1 , 0 ] . set_yticklabels ( [ f'Doc { i } ' for i in range ( len ( docs ) ) ] ) axes [ 1 , 0 ] . set_title ( 'Document Similarity Heatmap' ) plt . colorbar ( im , ax = axes [ 1 , 0 ] )
Text length distribution
text_lengths
[ len ( t . split ( ) ) for t in texts ] axes [ 1 , 1 ] . hist ( text_lengths , bins = 5 , color = 'coral' , edgecolor = 'black' ) axes [ 1 , 1 ] . set_xlabel ( 'Number of Words' ) axes [ 1 , 1 ] . set_ylabel ( 'Frequency' ) axes [ 1 , 1 ] . set_title ( 'Text Length Distribution' ) axes [ 1 , 1 ] . grid ( True , alpha = 0.3 , axis = 'y' ) plt . tight_layout ( ) plt . savefig ( 'nlp_analysis.png' , dpi = 100 , bbox_inches = 'tight' ) print ( "\nNLP visualization saved as 'nlp_analysis.png'" )
10. Summary
- (
- "\n=== NLP Summary ==="
- )
- (
- f"Texts processed:
- {
- len
- (
- texts
- )
- }
- "
- )
- (
- f"Unique vocabulary:
- {
- len
- (
- vocab
- )
- }
- words"
- )
- (
- f"Average text length:
- {
- np
- .
- mean
- (
- [
- len
- (
- t
- .
- split
- (
- )
- )
- for
- t
- in
- texts
- ]
- )
- :
- .2f
- }
- words"
- )
- (
- f"Classification accuracy:
- {
- accuracy_score
- (
- labels
- ,
- predictions
- )
- :
- .4f
- }
- "
- )
- (
- "\nNatural language processing setup completed!"
- )
- Common NLP Tasks and Models
- Classification
-
- DistilBERT, RoBERTa, ELECTRA
- NER
-
- BioBERT, SciBERT, spaCy models
- Translation
-
- MarianMT, M2M-100
- Summarization
-
- BART, Pegasus, T5
- QA
- BERT, RoBERTa, DeBERTa Text Preprocessing Pipeline Lowercasing and cleaning Tokenization Stopword removal Lemmatization/Stemming Vectorization Best Practices Use pre-trained models when available Fine-tune on task-specific data Handle out-of-vocabulary words Batch process for efficiency Monitor for bias in models Deliverables Trained NLP model Text classification results Named entities extracted Performance metrics Visualization dashboard Inference API